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Abstract — Independent Component Analysis (ICA) is an effective unsupervised tool to learn statistically independent representation. 
However, ICA is not only sensitive to whitening but also difficult to learn an over-complete basis. Consequently, ICA with soft 
Reconstruction cost(RICA) was presented to learn sparse representations with over-complete basis even on unwhitened data. Whereas 
RICA is infeasible to represent the data with nonlinear structure due to its intrinsic linearity. In addition, RICA is essentially an 
unsupervised method and can not utilize the class information. In this paper, we propose a kernel ICA model with reconstruction 
constraint (kRICA) to capture the nonlinear features. To bring in the class information, we further extend the unsupervised kRICA to a 
supervised one by introducing a discrimination constraint, namely d-kRICA. This constraint leads to learn a structured basis consisted 
of basis vectors from different basis subsets corresponding to different class labels. Then each subset will sparsely represent well for its 
own class but not for the others. Furthermore, data samples belonging to the same class will have similar representations, and thereby 
the learned sparse representations can take more discriminative power. Experimental results validate the effectiveness of kRICA and 
d-kRICA for image classification. 

Index Terms — Independent component analysis, nonlinear mapping, supervised learning, image classification. 



1 Introduction 

Sparsity is an attribute characterizing a mass of natural 
and manmade signals [IJ, and has played a vital role in 
the success of many machine learning algorithms and 
techniques such as compressed sensing ||2]/ matrix fac- 
torization [3], sparse coding |4|, dictionary learning fS], 
||6j, sparse auto-encoders [7\, Restricted Boltzmann Ma- 
chines (RBMs) ||8l and Independent Component Analysis 
(ICA) 19|. 

Among these, ICA transforms an observed multidi- 
mensional random vector into sparse components which 
are statistically as independent from each other as pos- 
sible. Specifically, to estimate the independent compo- 
nents, a general principle is the maximization of non- 
gaussianity |9J. This is based on the central limit theorem 
that sum of independent random variables is closer to 
gaussian than any of the original random variables, 
i.e., non-gaussian is independent. Meanwhile, sparsity 
is one form of non-gaussianity |10|, which is dominant 
in natural images. Then maximization of sparseness in 
natural images is basically equivalent to maximization of 
non-gaussianity Thus, ICA has been successfully applied 
to learn sparse representation for classification tasks by 
maximizing sparsity [llj. However, there are two main 
draw^backs to standard ICA. 

1) ICA is sensitive to whitening, which is an important 
preprocessing step in ICA to extract efficient features. In 
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addition, standard ICA is difficult to exactly whiten high 
dimensional data. For example, an input image of size 
100x100 pixels could be exactly whitened by principal 
component analysis(PCA), while it has to solve the 
eigen-decomposition of the 10,000 x 10,000 covariance 
matrix. 

2) ICA is hard to learn the over-complete basis (that 
is the number of basis vectors is greater than dimen- 
sionality of input data). Whereas Coates et al. [12J have 
shown that several approaches with over-complete basis, 
e.g., sparse autoencoders [7J, K-means [12J and RBMs [8J, 
obtain an improvement for the performance of classifica- 
tion. This puts ICA at a disadvantage compared to these 
methods. 

Both drawbacks are mainly due to the hard orthonor- 
mality constraint in standard ICA. Mathematically, that 
is WW^ = I, which is utilized to prevent degener- 
ate solution for the basis matrix W where each basis 
vector is a row of W. While this orthonormalization 
cannot be satisfied when W is over-complete. Specifi- 
cally, the optimization problem of standard ICA is gen- 
erally solved by using gradient descent methods, where 
W is orthonormalized at each iteration by symmetric 
orthonormalization, i.e., W <— {WW'^)~^/^W, which 
doesn't work for over-complete learning. In addition, 
although alternative orthonormalization methods could 
be employed to learn over-complete basis, they not only 
are expensive to compute but also may arise from the 
cumulation of errors. 

To address the above issues, Q.V. Le et al. IITSl re- 
placed the orthonormality constraint with a robust soft 
reconstruction cost for ICA (RICA). Thus, RICA can 
learn sparse representation with highly over-complete 
basis even on unwhitened data. However, this model 
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is SO far also a linear technique which is infeasible 
to discover nonlinear relationships among input data. 
Additionally, as an unsupervised method, RICA may 
not be sufficient for classification tasks, which failed to 
consider the association between the training sample and 
its class. 

Recall that, to explore the nonlinear features, kernel 
trick [14J can be used to nonlinearly project the input 
data into a high dimensional feature space. Therefore, we 
develop a kernel extension of RICA (kRICA) to represent 
the data with nonlinear structure. In addition, to bring 
in label information, we further extend the unsupervised 
kRICA to a supervised one by introducing a discrim- 
ination constraint, namely d-kRICA. Particularly, this 
constraint maximizes the homogeneous representation 
cost and minimizes the uihomogeneous representation 
cost jointly, which leads to learn a structured basis 
consisted of basis vectors from different basis subsets 
corresponding to the class labels. Then each subset will 
sparsely represent well for its own class but not for the 
others. Furthermore, data samples belonging to the same 
class will have similar representations, and thereby the 
obtained sparse representation can take more discrimi- 
native power. 

It is important to note that this work is fundamentally 
based on our previous work DRICA [15]. In comparison 
to DRICA, we further improve our work as follows: 

1) By taking advantage of the kernel trick, we replace 
the linear projection with nonlinear one to capture the 
nonlinear features. Experimental results show that our 
kernel extension usually further improves the image 
classification accuracy. 

2) The discriminative capability of basis is further en- 
hanced by maximizing the homogeneous representation 
cost besides minimizing the inhomogeneous represen- 
tation cost simultaneously. Thus, we can obtain a set 
of more discriminative basis vectors that are forced to 
sparsely represent better for their own classes but poorer 
for the others. Experiments show that this basis can 
further boost the performance for image classification. 

3) In the experiments, we conduct comprehensive 
analysis for our proposed method, e.g., the effects of 
different parameters and kernels for image classification, 
experiment settings, and the similarity comparative anal- 
ysis. 

The rest of the paper is organized as follows. In Section 
2, we revisit related works on sparse coding and RICA, 
and describe the connection between them. Then we 
give a brief review of reconstruction ICA in Section 3. 
Section 4 introduces the details of our proposed kRICA, 
including its optimization problem and implementation. 
By incorporating the discrimination constraint, kRICA 
is further extended to supervised learning in Section 
5. Section 6 presents extensive experimental results on 
image classification. Finally, we conclude our work in 
Section 7. 



2 Related Work 

In this section, we will review some related work in the 
following aspects: (1) Sparse coding and its applications; 
(2) Connection between RICA and sparse coding; (3) The 
other kernel sparse representation algorithms. 

Sparse coding is an unsupervised method for recon- 
structing a given signal by selecting a relatively small 
subset of basis vectors from an over-complete basis set, 
and meanwhile making the reconstruction error as small 
as possible. Because of its plausive statistical theory |16l, 
sparse coding has attracted more and more attention 
from scientists in computer vision field. Meanwhile, it 
has been successfully used for more and more computer 
vision applications, e.g., image classification |[l3, flSlI , 
fT9], face recognition [20], image restoration |211 etc. This 
success is largely due to two factors: 

1) The sparsity characteristic ubiquitously exists in 
many computer vision applications. For example, for im- 
age classification, the image components can be sparsely 
reconstructed by utilizing similar components of other 
images from same class [17J. Another example is face 
recognition. The face image to be tested can be accurately 
reconstructed by a few training images from the same 
category [20 1. As a consequence, sparsity is the founda- 
tion for these applications based on sparse coding. 

2) Images are often corrupted by noise, which may 
arise due to sensor imperfection, poor illumination or 
communication errors. While sparse coding can effec- 
tively select the related basis vectors to reconstruct the 
clean image, and meanwhile can deal with noise by al- 
lowing the reconstruction error and promoting sparsity. 
Therefore, sparse coding has been successfully applied 
to image denoising [22 J, image restoration [21] etc. 

Similar to sparse coding, ICA with a reconstruction 
cost (RICA) [13J also can learn highly over-complete 
sparse representation. In addition, in [13], it has been 
shown that RICA is mathematically equivalent to sparse 
coding if using explicit encoding and ignoring the norm 
ball constraint. 

The above-mentioned studies only seek the sparse rep- 
resentations of the input data in the original data space, 
which are incompetent to represent the data with non- 
linear structure. To solve this problem, Yang et al. f23l 
developed a two-phase kernel ICA algorithm: whitened 
kernel principal component analysis (KPCA) plus ICA. 
Different from |23J, another solution ^2^ was proposed 
to use contrast function based on canonical correlations 
in a reproducing kernel Hilbert space. However, both of 
these methods couldn't learn the over-complete sparse 
representation of nonlinear features due to the orthonor- 
mality constraint. Therefore, to find such representation, 
Gao et al. f25ll , l26l presented a kernel sparse coding 
method (KSR) in a high dimensional feature space. But 
this work failed to utilize the class information as an 
unsupervised approach. Additionally, in Section 4.4, we 
will show that our proposed kernel extension of RICA 
(kRICA) is equivalent to KSR under certain conditions. 
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3 Reconstruction ICA 

Since sparsity is one form of non-gaussianity, maximiza- 
tion of sparsity for ICA is equivalent to maximization 
of independence llTOl ■ Given the unlabeled data set X = 
{xi}i=i where Xi € i?", the optimization problem of 
standard ICA f9\ is generally defined as 
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where g{-) is a nonlinear convex function, W = 
[wi, W2, ■ ■ ■ , wk]'^ G i?^^" is the basis matrix, K is the 
number of basis vectors and Wj is j-th row basis vector 
in W, and / is the identity matrix. Additionally, the 
orthonormality constraint WW"^ = / is traditionally 
utilized to prevent the basis vectors in W from becoming 
degenerate. Meanwhile, a good general purpose smooth 
Li penalty is: g{-) = log(cosh(-)) |10l. 

However, as above pointed out, the orthonoramlity 
constraint makes standard ICA difficult to learn the 
over-complete basis. In addition, ICA is sensitive to 
whitening. These drawbacks restrict ICA to scale high 
dimensional data. Consequently, RICA [il3| used a soft 
reconstruction cost to replace the orthonormality con- 
straint in ICA. Applying this replacement to Equation 
(|2j, RICA can be formulated as the following uncon- 
strained problem 



min — y \ 
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where parameter A is a tradeoff between reconstruction 
and sparsity. Swapping the orthonormality constraint 
with a reconstruction penalty, the RICA could learn 
sparse representations even on the data without whiten- 
ing when W is over-complete. 

Furthermore, since the Li penalty is not sufficient to 
learn invariant features [TO], RICA fT3\, fTT] replaced 
it by a L2 pooling penalty which encourages pooling 
features to group similar features together to achieve 
complex invariances such as scale and rotational invari- 
ance. Besides, the L2 pooling can also promote sparsity 
for feature learning. Particularly, L2 pooling 1281 . 1291 is a 
two-layered network with square nonlinearity in the first 
layer, and square-root nonlinearity in the second layer: 
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where Hj is the row of spatial pooling matrix H e R^ ^ ^ 
fixed to uniform weights and e is a small constant to 
prevent division by zero. 

Nevertheless, RICA is infeasible to represent the data 
with nonlinear structure due to its intrinsic linearity. 
In addition, this model just simply learned the over- 
complete basis set with reconstruction cost while failed 
to consider the association between the training sample 
and its class, which may be insufficient for classification 



tasks. To address these problems, on one hand, we focus 
on developing a kernel extension of RICA to find the 
sparse representation of nonlinear features. On the other 
hand, we aim to learn a more discriminative basis by 
bringing in class information than unsupervised RICA, 
which will facilitate the better performance of sparse 
representation in classification tasks. 

4 Kernel Extension for RICA 

Motivated by the success that kernel trick can capture 
the nonlinear structure in data [14J, we propose a kernel 
version of RICA, called kRICA, to learn the sparse 
representation of nonlinear features. 

4.1 Model Formulation 

Suppose that there is a kernel function «(-, •) induced 
by a high dimensional feature mapping 4> : R" -^ _R^, 
where n ^ Af. Given two data points Xi and Xj, 
K{xi,Xj) = 4){xi) 4){xj) represents a nonlinear similarity 
between them. Then the function maps the data and 
basis from the original data space to the feature space 
as follows. 



X — > (p(x) 



W = [wi, ..., WKf A W - [^{wi), ..., cj,{wk)? 



(4) 



Furthermore, by substituting the mapped data and basis 
into Equation (|2]l, we can get the following objective 
function of kRICA. 



mm 

W 771 



- V[||W^W0(x,) - cl,{x,)\\l + \g{Wcl,{x,))] (5) 
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Due to its excellent performance in many computer 
vision applications [14 1, |25|, Gaussian kernel, i.e., 
K{xi,Xj) — exp(— 7||a;i — a^jHl) is used in this study. 
Thus, the norm ball constraints on basis in RICA can 
be removed owing to (j>{wi) <l){wi) = n{wi,Wi) — 1. 

In addition, we perform kernel principal component 
analysis (KPCA) in the feature space for data whitening 
similar to |23|, which makes the problem of ICA esti- 
mation simpler and better conditioned HOll . When data 
is whitened, there exists a close relationship between 
kernel ICA [23J and kRICA. Regarding this relationship, 
we have the following Lemma: 

Lemma 4.1 YJhen the input data set X = {xi}™^ 
is whitened in the feature space, the reconstruction cost 

m 

m E ||W-^yV0(xi) — 0(a;i)||2 is equivalent to the orthonor- 



I\\% 



mality cost ||>V^>V 
Where || • ||jr is the Frobenius norm. Lemma 4.1 shows 
that kernel ICA's hard orthonormality constraint and 
kRICA's reconstruction cost are equivalent when data 
is whitened. While kRICA can learn the over-complete 
sparse representation of nonlinear features and kernel 
ICA fails to work due to the orthonormality constraint. 
Please see the Appendix A for a detailed proof. 
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4.2 Implementation 

The Equation ||5ll is an unconstrained convex optimiza- 
tion problem. To solve this problem, we rewrite the 
objective as follows 

m '^ — ' 

i=l 
ra K K 



i=\ 



K 



K 
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£ + ^/lj„(K(w„,Xi))^], 



(6) 

where w„ and w^ are the rows of basis W , and /ij„ is the 
element in pooling matrix H. Since the row vjj of W is 
contained in the kernel k(wj^ •), it is very hard to directly 
utilize the optimization methods in RICA, e.g. L-BFGS 
and CG |30|, to compute the optimal basis. Thus, to solve 
this problem, we alternatively optimize each row of basis 
W instead. With respect to each updating row w^ of W , 
the derivative of /(M^) is 



K 
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\le + Y, hjy{n{wv,Xi)) 



Then, to compute the optimal Wp, we set 
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Since Wp is contained in K(wp, ■), it is challenging to 
solve the Equation 0. Thus, we seek the approximate 
solution instead of the exact solution. Inspired by fixed 
point algorithm ||25), to update Wp in the (g)-th iteration, 
we utilize the result of Wp in the {q — l)-th iteration to 
calculate the part in the kernel function. In addition, we 
utilize k-means to initialize the basis followed by |25| . 
Let denote the Wp in the (q)-th iteration as Wpj^^), and 
the Equation (O with respect to Wp.(g) becomes 



df 



K 



- ^^l^^l^{Wp^(q-l),Xi)K.{Wp^(q^l),Wv)x 
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^ (^p,(9) ~ ^i) + 2Ay^ 
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K 



Is + J2 hjy{K{Wy,Xi))^ 



0. 



When all the remaining rows are fixed, the problem 
becomes a linear equation of Wp (g), which can be solved 
straightforwardly. 



4.3 Connection between kRICA and KSR 

It is clear there is a close connection between the pro- 
posed kRICA and KSR [25|. Similar to kRICA, KSR 
attempts to find the sparse representation of nonlinear 
features in a high dimensional feature space and its 
optimization problem is 



rn 

inlV 



mm 



[\\W^s,-(t>ix,)\\l + X\\s,\\i] 
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where Si £ R^ is the sparse representation of sample 
Xi. Therefore, there are two major differences between 
them. 

(1) KSR utilizes explicit encoding for sparse represen- 
tation corresponding to input data sample, i.e., Si = 
W(t){xi). Since the objective of Equation (|8} in KSR is 
not convex, the basis W and sparse codes Vi should be 
optimized, alternatively. 

(2) The simple Li penalty, g{si) — ||si||i, is employed 
by KSR to promote sparsity while kRICA uses L2 pool- 
ing instead, which can force the pooling features to 
group similar features together to achieve invariance, 
and meanwhile optimize the sparsity. 

5 Supervised Kernel RICA 

Given the labeled training data, our goal is to utilize 
class information to learn a structured basis set, which 
is consisted of basis vectors from different basis subsets 
corresponding to different class labels. Then each subset 
will sparsely represent well for its own class but not 
for the others. Thus, to learn such basis, we further 
extend the unsupervised kRICA to a supervised one 
by introducing a discrimination constraint, namely d- 
kRICA. 

Mathematically, when the sample Xi is labeled as 
i/i £ {1, . . . , c} where c is the total number of classes, we 
can further utilize class information to learn a structured 
basis set W = [W(^^\w(^^\ . . . ,W^''^'^ £ i?^'^", where 
lY^Vi) £ Ji^xn j^g |.]^g basis subset that can well represent 
the sample Xi belonging to the j/i-th class rather than 
others, k is the number of basis vectors for each subset 
and K = k * c. Let denote Si = Wxi where Si can be 
regarded as the sparse representation of sample Xi IIT3l . 

5.1 Discrimination constraint 

Since we aim to utilize class information to learn a 
structured basis, we hope that the sample Xi labeled as yi 
will only be reconstructed by the basis subset Wy- with 
coefficients Si. To achieve this goal, an inhomogeneous 
representation cost constraint |15j, [31J was utilized to 
minimize the inhomogeneous representation coefficients 
of Si, i.e., coefficients corresponding to basis vectors 
other than belonging to Wy^. However, this constraint 
only focuses on minimizing the inhomogeneous coeffi- 
cients while fails to consider maximizing the the homo- 
geneous ones, which is not sufficient to learn an optimal 
structured basis. Consequently, to learn such basis, we 
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introduce a discrimination constraint, which maximizes 
the homogeneous representation cost and minimizes the 
inhomogeneous representation cost, jointly. Mathemati- 
cally, we define the homogeneous cost as P+ and the 
inhomogeneous cost as P_. Specifically, P+ and P_ are 



TABLE 1 
Image classification Accuracy on Caltech 101 dataset. 



P+ = 
P_ = 



\D, 



Vi 



\D-yM\l 



(9) 



where D+y^ e P^ and D^y^ e R'^ select the ho- 
mogeneous and inhomogeneous representation coeffi- 
cients of Si, respectively. For example, assuming W = 

[W^^lW^^lW^^Y, w(y^'> G P2'<"(j/, e {1,2,3}) and 



Training size 


15 


30 


ScSPM |17| 


67.0% 


73.2% 


D-KSVD |6| 


65.1% 


73.0% 


LC-KSVD (19] 


67.7% 


73.6% 


RICA |T3l 


67.1% 


73.7% 


KICA (23*1 


65.2% 


72.8% 


KSR|25| 


67.9% 


75.1% 


DRICA |15| 


67.8% 


74.4% 


d-RICA 


68.7% 


75.6% 


kRICA 


68.2% 


75.4% 


d-kRICA 


71.3% 


77.1% 



yi=3, D+y^ 
follows. 



and -D-y, can be respectively defined as 6 EXPERIMENTS 



^+3 
D-3 



[0 
[1 



1 ] 
1 



hituitively, we can define the discrimination constraint 
function d{si) as P_ — P+, which means the sparse 
representation Si in terms of basis matrix W will only 
concentrate on the basis subset W^^'K However, this 
constraint is non-convex and unstable. To address the 
problem, we propose to incorporate an elastic term 1 1 Si 1 1 2 
into d{si). Thus, d{si) is defined as 



d{si) = \\D^y^Si 



\D, 



V\\si 



(10) 



It can be proved that if r; > A: + 1, d{si) is strictly convex 
to Si. Please see the Appendix B for a detailed proof. 
The constraint i fTOl l maximizes the homogeneous repre- 
sentation cost and minimizes the inhomogeneous rep- 
resentation cost, simultaneously, which leads to learn a 
structured basis consisted of basis vectors from different 
basis subsets corresponding to the class labels. Then each 
subset will sparsely represent well for its own class but 
not for the others. Furthermore, data samples belonging 
to the same class will have similar representations, and 
thereby the obtained new representations can take more 
discriminative power. 

By incorporating the discrimination constraint into the 
kRICA framework (d-kRICA), we can get the following 
objective function 



in-^[||W^W0(x.) 

V m ^ — ^ 



nun 

w m 



(t>{xi)\ 



i=l 



(11) 



Xg{W(j){x,)) + adiW^x,))], 



where A and a are the scalars controlling the relative 
contribution of the corresponding terms. Given a test 
sample. Equation (|TT] | means that the learned basis set 
can sparsely represent it with nonlinear structure while 
demands its homogeneous representations as large as 
possible and meanwhile inhomogeneous representations 
as small as possible. Following kRICA, the optimization 
problem ((TTl l can be easily solved by the above proposed 
fixed point algorithm. 



In this section, we will firstly introduce the feature 
extraction for image classification. Then, we evaluate 
the performances of our kRICA and d-kRICA for image 
classification on three public datasets: Caltech 101 |32| , 
CIFAR-10 [12J and STL-10 [12J. Furthermore, we study 
the selections of tuning parameters and kernel functions 
for our method. Finally, we give the similarity matrix 
to further illustrate the performances of kRICA and d- 
kRICA. 



6.1 Feature Extraction for Classification 

Given apxp input image patch (with d channels) x £ P" 
(n = p X p X d), kRICA can transform it to a new 
representation s = W4){xi) G R^ in the feature space, 
where p is termed as the 'receptive field size'. For an 
image of A^ x M pixels (with d channels), we could 
obtain a {N - p + I) x {M - p + l)(with K channels) 
feature following the same setting in [131, by estimating 
the representation for each p x p 'subpatch' of the input 
image. To reduce the dimensionality of the image rep- 
resentation, we utilize similar pooling method in [13] to 
form a reduced 4/ir-dimensional pooled representation 
for image classification. Given the pooled feature for 
each image, we utilize linear SVM for classification. 



6.2 Classification on Caltech 101 

Caltech 101 dataset consists of 9144 images which are 
divided among 101 object classes and 1 background class 
including animals, vehicles, etc. Following the common 
experiment setup IITTI , we implement our algorithm 
on 15 and 30 training images per category with basis 
size K = 1020 and 10x10 receptive fields, respectively. 
Comparison results are shown in Table 2. We compare 
our classification accuracy with ScSPM [17], D-KSVD [6|, 
LC-KSVD [l19J, RICA [13], KICA [23], KSR [25] and 
DRICA IITSll . In addition, in order to compare with 
DRICA, we incorporate the discrimination constraint 
lUni into the RICA framework Q, namely d-RICA. 

Table H] shows that kRICA and d-kRICA outperform 
the other competing approaches. 



JOURNAL OF W^X CLASS FILES, VOL. 6, NO. 1 , JANUARY 2007 



TABLE 2 
Test Classification Accuracy on CIFAR-10 dataset. 



Model 


Accuracy 


Improved Local Coord. Coding |18[ 


74.5% 


Conv. Deep Belief Net (2 layers) |33| 


78.9% 


Sparse auto-encoder 1 12J 


73.4% 


Sparse RBM |121 


72.4% 


K-means (Hard) (12 


68.6% 


K-means (Triangle) 121 


77.9% 


K-means (Triangle, 4000 features) |12| 


79.6% 


RICA |13| 


81.4% 


KICA |23| 


78.3% 


KSR |25| 


82.6% 


DRICA 115J 


82.1% 


d-RICA 


82.9% 


kRICA 


83.4% 


d-kRICA 


84.5% 



6.3 Classification on CIFAR-10 

The CIFAR-10 dataset includes 10 categories and 60000 
32x32 color images in all with 6000 images per category, 
such as airplane, automobile, truck and horse etc. In 
addition, there are 50000 training images and 10000 
testing images. Specifically, 1000 images from each class 
are randomly selected as test images and the other 5000 
images from each class as training images. In this exper- 
iment, we fix the size of basis set to 4000 with 6x6 recep- 
tive fields followed by [12|. We compare our approach 
with RICA, K-means (Triangle, 4000 features) [12|, KSR, 
DRICA and d-RICA etc. 

Table|2]shows the effectiveness of our proposed kRICA 
and d-kRICA. 
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Fig. 1. Classification performance on STL-10 dataset witin 
varying basis size and 8x8 receptive fields. 

TABLE 3 
Test Classification Accuracy on STL-10 dataset. 



Model 


Accuracy 


Raw pixels 112[ 

K-means(Triangle 1600 features) (T2| 

RICA(8x8 receptive fields) (H 

RICA(10xlO receptive fields) (131 

KICA |23| 

KSR |25| 

DRICA 1151 


31.8% 
51.5% 
51.4% 
52.9% 
51.1% 
54.4% 
54.2% 


d-RICA 

kRICA 

d-kRICA 


54.8% 
55.2% 
56.9% 



6.4 Classification on STL-10 

In STL-10, there are 10 classes(e.g., airplane, dog, mon- 
key and ship etc), where each image is 96x96 pixels and 
color. In addition, this dataset is divided into 500 training 
images (10 pre-defined folds), 800 test images per class 
and 100,000 unlabeled images for unsupervised learning. 
In our experiments, we set the size of basis set K= 1600 
and 8x8 receptive fields in the same manner described 
inCa. 

Table |3] shows the classification results of the raw pix- 
els ma, K-means, RICA, KSR, DRICA, d-RICA, kRICA 
and d-kRICA. 

As can be seen, d-RICA achieves better performance 
than DRICA on all of the above datasets. It is because 
that DRICA just only minimized the inhomogeneous 
representation cost for structured basis learning, while 
d-RICA simultaneously maximizes the homogeneous 
representation cost and minimizes the inhomogeneous 
representation cost, which makes the learned sparse rep- 
resentation take more discriminative power. Although 
both DRICA and d-RICA introduce the class informa- 
tion, unsupervised kRICA still performs better than both 
these algorithms. This means that kRICA implies more 
discriminative power for classification by representing 
the data with nonlinear structure. Additionally, since 
kRICA utilizes the L2 pooling instead of Li penalty to 



achieve feature invariance, it demonstrates better per- 
formance than KSR. Furthermore, the d-kRICA achieves 
better performance than kRICA in all the cases by bring- 
ing in class information. 

We also investigate the effect of basis size for our 
proposed kRICA and d-kRICA on STL-10 dataset. In our 
experiments, we try seven sizes: 50, 100, 200, 400, 800, 
1200 and 1600. As shown in Fig.[TJ the classification accu- 
racies of d-kRICA and kRICA continue to increase when 
the basis size goes up to 1600 and the performances 
augment slightly from basis size of 800. Especially, d- 
kRICA outperforms all the other algorithms all the way. 



6.5 Tuning Parameter and Kernel Selection 

In the experiments, the tuning parameters in kRICA and 
d-kRICA, i.e. A, a and 7 in the objective function, are 
verified by cross validation to avoid over-fitting. More 
specifically, we experimentally set these parameters as 
follows. 

The effect of A : The parameter A is the weight of 
sparsity term, which is an important factor in kRICA. 
To facilitate the parameter selection, we experimentally 
investigate how the performance of kRICA varies with 
the parameter A on STL-10 dataset in Fig. |2] (7 = 10^^). 
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Fig. 2. The relationship between the weight of sparsity 
term (A) and classification accuracy on STL-10 dataset. 



Fig. 3. The relationship between the weight of discrimi- 
nation constraint term (a) and classification accuracy on 
STL-10 dataset. 



Fig. |2] shows that kRICA achieves best performance 
when A is fixed to be 10^^. Thus, we set A = 10^^ for 
STL-10 data. In addition, we test the accuracy of RICA 
under the same sparsity weight. It is easy to find that 
our proposed nonlinear RICA (kRICA) can consistently 
outperform linear RICA with respect to A. Similarly, 
we experimentally set A = 10^^ for Caltech data and 
A == 10-2 for CIFAR-10 data. 

The effect of a : The parameter a controls the weight 
of discrimination constraint term. When a = 0, the 
supervised d-kRICA optimization problem becomes the 
unsupervised kRICA problem. Fig. |3] shows the rela- 
tionship between the weight of discrimination constraint 
term a and classification accuracy on the STL-10. We 
can see that d-kRICA achieves best performance when 
a — 10^^. Hence, we set a = 10^^ for STL-10 data. 
In particular, d-RICA achieves better performance than 
DRICA in a wide range of a values. This is because 
that DRICA just only minimizes the inhomogeneous 
representation cost, while d-RICA jointly optimizes both 
the homogeneous and inhomogeneous representation 
costs for basis learning, which makes the learned sparse 
representations take more discriminative power. Further- 
more, by representing the data with nonlinear structure, 
d-kRICA implies more discriminative power for classifi- 
cation and outperforms both these algorithms. Similarly, 
we set a — 1 for Caltech data and a — 10^^ for CIFAR-10 
data. 

The effect of 7 : When we utilize the Gaussian kernel 
in kRICA, it is vital to select the kernel parameter 7, 
which affects the image classification accuracy. Fig. |4] 
shows the relationship between 7 and classification ac- 
curacy on STL-10 dataset. Therefore, we set 7 = 10^^ for 
STL-10 data. Similarly, we experimentally set 7 — 10^^ 
for Caltech data and 7 = lO'^ for CIFAR-10 data. 

We also investigate the effect of different kernels for 
kRICA in image classification, i.e.. Polynomial kernel: 




10" 



10" 



10"^ y 10"' 



10" 



10' 



Fig. 4. Classification performance on STL-1 dataset with 
varying kernel parameter (7) in Gaussian kernel. 



Square Distance kernel: jirCTkE^T^' Exponential His- 
togram Intersection kernel: ^^ min(e^^', e^^')|^ Table |4] 
demonstrates the classification performances of different 
kernels on STL-10 dataset, and Gaussian kernel out- 
performs the other kernels. Thus, we employ Gaussian 
kernel in our studies. 

1. FoUowing the work |26[, we set b=3 for Polynomial kernel and 
b=l for the others. 



TABLE 4 

Classification performances of different kernels on 

STL-10 dataset 



(1 



y) , Inverse Distance kernel: 



l+b\\x-y\\' 



Inverse 



Kernel 


Accuracy 


Polynomial kernel 


54.2% 


Inverse Distance kernel 


38.3% 


Inverse Square Distance kernel 


47.6% 


Exponential Histogram Intersection kernel 


36.5% 


Gaussian kernel 


56.9% 
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6.6 Similarity Analysis 

In above sections, we have shown the effectiveness of 
kRICA and d-kRICA for image classification. To further 
illustrate their performances, we firstly choose 90 images 
from three classes in Caltech 101, and 30 images for each 
class. Then we compute the similarity between sparse 
representations of these images for RICA, kRICA and 
d-kRICA, respectively. Fig. |5] demonstrates the similar- 
ity matrices corresponding to sparse representations of 
RICA, kRICA and d-kRICA, respectively. Each element 
{i,j) in similarity matrix is the sparse representation sim- 
ilarity measured by Euclidean distance between image i 
and j. Since a good sparse representation method can 
make the new representations belonging to the same 
class more similar, their similarity matrix also should 
be block-wise. Fig. |5] shows that nonlinear kRICA takes 
more discriminative power than linear RICA, and d- 
kRICA achieves best by hinging in class information. 

7 Conclusions 

In this paper, we propose a kernel ICA model with re- 
construction constraint (kRICA) to capture the nonlinear 
features. To bring in the class information, we further 
extend the unsupervised kRICA to a supervised one by 
introducing a discrimination constraint. This constraint 
leads to learn a structured basis consisted of basis vectors 
from different basis subsets corresponding to different 
class labels. Then each subset will sparsely represent well 
for its own class but not for the others. Furthermore, 
data samples belonging to the same class will have 
similar representations, and thereby the obtained sparse 
representation can take more discriminative power. The 
experiments conducted on standardized datasets have 
demonstrated the effectiveness of our proposed method. 

Appendix A 

Proof of Lemma 4.1 

Poof 

Since the input data set X = {xi}™ ^ is whitened in 
the feature space by KPCA, we have 

_. m 

E[q^{Xmxf] = - V <j>{x,)<j>{x,f = /, 
m ^ — ' 

where / is the identity matrix. Furthermore, we can 
obtain 



m ^ — ^ 

i—l 

m ^ — ' 
=Tr[(W^W - /)^(W^>V - /)— y ^{x,)<i>{xif\ 

m ^ — ^ 



where Tr[-] denotes the trace of a matrix, and the steps 
of derivation employ the matrix property Tr{AB) — 
Tr{BA). Thus, the reconstruction cost is equivalent to 
the orthonormality constraint when data is whitened in 
the feature space. 



Appendix B 

Proof of the convexity of d{si) 

We rewrite the Equation ((TOl l as 

d{si) = \\D^y^Si\\l - \\D+y^s^\\l + 'n\\s^\\l 

= Tr[sjD'^yV-y^Si - sjD'^y.D+y^Si + i-jsj Si\ 

(12) 
Then, we can obtain its Hessian matrix V^rf with respect 
to s,. 



V^d =2D^^_Dy^ 



'2Dl + Dy,+ 



2r]I 



(13) 



Without loss of generality, we assume 




After some derivations, we have V^d = 2 x A, where 

rr)+l ■■■ 1 ■■■ 1 ■■■ 1 



r) + l 
ri-1 



-1 •■ 

1 ■■ 



r)-l 
ri + 1 



V + l 



The convexity of d{si) depends on whether its Hes- 
sian matrix V^d, i.e. matrix A, is positive definite or 
not [34J. Meanwhile, the K x K matrix A is positive 
definite if and only if z^ Az > for all nonzero vectors 
z £ R^ l35ll , where z^ denotes the transpose. Let the 
size of upper left matrix in ^ be t x t, and suppose 
z = [zi,--- ,zt,zt+i,--- .,zt+k:Zt+k+i,--- , zk]^- Then, we 
have 

(»7 + l)zi + 22 H \- zt + zt+k+i H ^ ZK 



Az 



\W'^W-I\\jr, 



zi + Z2-\ 1- (r? + l)zt + zt+k+i ^ ^ zk 

(V - l)2t+i - zt+2 zt+k 



-zt+i - Zt+2 F (»7 - l)2t+fc 

21 + Z2 H V zt + (r]^ l)2t+fc+i H V ZK 



21 + Z2 H h 2t + Zt+k+\ H F (»? + \)zk 

Furthermore, we can get 
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(a) RICA (b) kRICA (c) d-kRICA 

Fig. 5. The similarity matrices for sparse representations of RICA, kRICA and d-kRICA. 



t+k 



''Az =i'n + 1) ^ z2 _^ (,, _ 1) ^ z2 _^ (,, + 1) ^ ; 



2 ^ZiZj + 2 ^ZiZ^ 


+ 2^.,., 


[2] 


l<i<t-l l<i<t 
2<j<t t + k + l<j<K 


t + l<i<t + fc-l 
t+2<j<t+k 
i<3 


[3] 


2j2z^z, 




[4] 


t + l<i<t + k-l 
t+2<j<t+k 

t K 




[5] 


E^?+ y]z2) + (21+^2 + 
1=1 i^t + k + 1 


■■ + Zt + Zt + k + 1 


[6] 



+ ■ ■ ■ + ZKf + {V - i)Y^ A. 



E 



Zi Zj . 

t + l<i<t + k-l 
t+2<j<t+k 



Define function ft, (77) = z^ Az, and when 77 > /c + 1, it is 
easy to verify that 

t K 

Hv) > h(k + 1) = (k + 1)(^ 7.1 + ^ z2) + (21 + ■ ■ . + ^t 



yk + l + --- + ZKf +k ^ 



i = l 


i^t+k+1 


t + k 




= 1.-? 


- 2 ^2,.,. 


i=t+l 


t + l<i<t + fe-l 




t+2<j<t+k 




Kj 



+ ZK? 



= '^(E ^? + ^4)+i^^+---+^t+ zt+fe+i 

1=1 i = t + fc + l 

A' 

i=l t+l<i<t+fc-l 

t+2<j<t+fe 

Since X) ^f > 0, we have /i(r/) > /7.(/c + 1) > 0. Thus, 

i=l 

Hessian matrix V^d is positive definite for ?/ > /c + 1, 
which guarantees that d{si) is convex to Si. 
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